Final Project

Final Project
Author

Ryan Zomorrodi

Published

April 25, 2024

Introduction

The following project is a replication attempt of the machine learning model described in Huynh et al. [1] predicting lead exposure from drinking water within the city of Chicago. Lead-contaminated drinking water at the block level was defined as a binary variable indicating whether the majority of tests within a block have at least 1 ppb lead concentration. According to their model, approximately 75% of blocks are estimated to have lead-contaminated drinking water. The greatest predictors of lead-contaminated drinking water were geographic areas, population at the block, and number of buildings.

Huynh et al. (and by extension, I) used the following data sources:

Data Sources
Source Measure Extent
City of Chicago Department of Water Management Lead Test Data Consecutive lead tests (ppb) Anonymized to the block
Census Block FIPS Block
Population (#) Block

Race/ethnicity (#)

  • AIAN

  • Asian

  • Black

  • Hispanic

  • White

Block
American Community Survey Block FIPS Block
Block group FIPS Block group
Population (#) Block group, tract

Race/ethnicity (#)

  • AIAN

  • Asian

  • Black

  • Hispanic

  • White

Block group, tract
Housing units (#) Block group
Median house value ($) Block group
Upper house value ($) Block group
Lower house value ($) Block group
Median homeowner costs ($) Block group

Education (#)

  • High school

  • GED

  • <1 year of college

  • > 1 year of college

  • Associate’s degree

  • Bachelor’s degree

  • Master’s degree

  • Professional School

  • Doctoral Degree

Block group
Poverty (#) Block group
English-only speakers (#) Block group
Computer access (#) Block group
Internet access (#) Block group
Complete plumbing facilities (#) Block group
Vacant Housing (#) Block group
Owner-occupied (#) Block group
Children under 5 (#) Block group
Children under 10 (#) Block group
Children under 18 (#) Block group
Chicago Building Footprints Median building age (years) Block
Building (#) Block
Max age (years) Block
Mean age (years) Block
Built after 1986 (#) Block
Chicago Health Atlas Community area Community area
Lead poisoning rate (%) Community area
Historical lead poisoning rate (%) Community area
Economic diversity index Tract
Hardship index Tract
Social vulnerability index Tract
Major crime (#) Tract
Eviction rate (%) Tract
Fine particulate matter concentration (\(\mu\text{g/m}^3\)) Tract
Access to food (%) Tract
Uninsured rate (%) Tract
Per capita income ($) Tract
Cognitive difficulty (%) Tract
Disability (%) Tract
Crowded housing (%) Tract
Rent burdened (%) Tract
Vacant housing (%) Tract

Data

Libraries

Code
library(bonsai)
library(corrr)
library(finetune)
library(future)
library(httr2)
library(jsonlite)
library(leaflet)
library(magrittr)
library(mice)
library(readxl)
library(sf)
library(themis)
library(tidycensus)
library(tidymodels)
library(tidyverse)
library(tigris)

Chicago Boundaries

Download

Download boundaries at various levels: the city, community areas, and census blocks.

Code
chicagoBoundaries <- st_read("https://data.cityofchicago.org/api/geospatial/qqq8-j68g?fourfour=qqq8-j68g&cacheBust=1712775952&date=20240411&accessType=DOWNLOAD&method=export&format=GeoJSON") %>%
    st_transform("EPSG:4269")

chicagoCommunityAreas <- st_read("https://data.cityofchicago.org/api/geospatial/cauq-8yn6?method=export&format=GeoJSON") %>%
    st_transform("EPSG:4269")

cookBlocks <- blocks(state = "IL", county = "Cook") 

Process

Perform an intersection on census blocks to subset to those blocks that are within Chicago Boundaries. Mutate GEOIDs to create complete block, block group, and tract GEOIDS.

Code
chicagoBlocks <- cookBlocks %>%
    filter(st_intersects(., chicagoBoundaries, sparse = FALSE) %>% unlist()) %>%
    filter(!str_detect(TRACTCE20, "^9900")) %>%
    mutate(GEOID_blk = GEOID20) %>%
    mutate(GEOID_blkGrp = str_sub(GEOID_blk, 1, -4)) %>%
    mutate(GEOID_tract = str_sub(GEOID_blk, 1, -5)) %>% 
    select(starts_with("GEOID_"))

American Community Survey

Download

Get metadata of all the relevant American Community Survey data.

Code
census_metadata <- 
    bind_rows(
        load_variables(2022, "acs5"), 
        load_variables(2020, "pl"))

blk_vars <- c(
    race_blk_total = "P2_001N",
    race_blk_aianNH = "P2_007N",
    race_blk_asianNH = "P2_008N",
    race_blk_blackNH = "P2_006N",
    race_blk_hispanic = "P2_002N",
    race_blk_whiteNH = "P2_005N"
)

blkGrp_vars <- c(
    race_blkGrp_total = "B03002_001",
    race_blkGrp_aianNH = "B03002_005",
    race_blkGrp_asianNH = "B03002_006",
    race_blkGrp_blackNH = "B03002_004",
    race_blkGrp_hispanic = "B03002_012",
    race_blkGrp_whiteNH = "B03002_003",
    housingUnits = "B25001_001",
    lowerValue = "B25076_001",
    medValue = "B25077_001",
    upperValue = "B25078_001",
    homeownerCost = "B25088_001",
    education_total = "B15003_001",
    education_highSchool = "B15003_017",
    education_ged = "B15003_018",
    education_lt1yCollege = "B15003_019",
    education_mt1yCollege = "B15003_020",
    education_associate = "B15003_021",
    education_bachelor = "B15003_022",
    education_master = "B15003_023",
    education_professional = "B15003_024",
    education_doctorate = "B15003_025",
    poverty_total = "B17010_001",
    poverty_belowPovertyLvl = "B17010_002", 
    language_total = "B99162_001",
    language_onlyEnglish = "B99162_002",
    computer_total = "B28003_001",
    computer_hasAComp = "B28003_002",
    internet_total = "B28011_001",
    internet_noInternet = "B28011_008",
    plumbing_total = "B25047_001",
    plumbing_complete = "B25047_002",
    vacancy_total = "B25002_001",
    vacancy_vacant = "B25002_003",
    occupied_total = "B25003_001",
    occupied_owner = "B25003_002",
    age_total = "B01001_001",
    age_maleU5 = "B01001_003",
    age_male5to9 = "B01001_004",
    age_male10to14 = "B01001_005",
    age_male15to17 = "B01001_006",
    age_femaleU5 = "B01001_027",
    age_female5to9 = "B01001_028",
    age_female10to14 = "B01001_029",
    age_female15to17 = "B01001_030")

tract_vars <- c(
    race_tract_total = "B03002_001",
    race_tract_aianNH = "B03002_005",
    race_tract_asianNH = "B03002_006",
    race_tract_blackNH = "B03002_004",
    race_tract_hispanic = "B03002_012",
    race_tract_whiteNH = "B03002_003",
    native_total = "B05002_001",
    native_nativeBrn = "B05002_002")

select_vars_metadata <- census_metadata %>%
    filter(name %in% c(blk_vars, blkGrp_vars, tract_vars))

Download American Community Survey data for all the relevant variables

Code
blk_data <- get_decennial(geography = "block", variables = blk_vars, state = "IL", county = "Cook", output = "wide")
blkGrp_data <- get_acs(geography = "block group", variables = blkGrp_vars, state = "IL", county = "Cook", output = "wide") 
tract_data <- get_acs(geography = "tract", variables = tract_vars, state = "IL", county = "Cook", output = "wide") 

Process

Join datasets and utilize multiple imputation by chained equations with random forest as the regression model.

Code
chicagoACS <- blk_data %>%
    select(-NAME) %>%
    rename(GEOID_blk = GEOID) %>%
    mutate(GEOID_blkGrp = str_sub(GEOID_blk, 1, -4)) %>%
    mutate(GEOID_tract = str_sub(GEOID_blk, 1, -5)) %>%
    relocate(GEOID_blkGrp, GEOID_tract, .after = GEOID_blk) %>%
    left_join(
        blkGrp_data %>%
            select(-ends_with("M"), -NAME) %>%
            mice(meth = "rf", seed = 123) %>%
            complete() %>%
            as_tibble(),
        join_by(GEOID_blkGrp == GEOID)) %>%
    left_join(
        tract_data %>%
            select(-NAME), 
        join_by(GEOID_tract == GEOID)) %>%
    mutate(
        age_U5E = age_maleU5E + age_femaleU5E,
        age_5to9E = age_male5to9E + age_female5to9E,
        age_10to17E = age_male10to14E + age_female10to14E + age_male15to17E + age_male15to17E) %>%
    select(-contains("male"), -ends_with("M")) %>%
    right_join(
        chicagoBlocks %>%
            st_drop_geometry()
    )

Chicago Building Footprints

Download

Download Chicago Building Footprints dataset to identify the age and number of buildings.

Code
chicagoBuildings <- st_read("https://data.cityofchicago.org/api/geospatial/syp8-uezg?fourfour=syp8-uezg&cacheBust=1712775954&date=20240414&accessType=DOWNLOAD&method=export&format=GeoJSON") %>%
    st_transform("EPSG:4269")

Process

Remove buildings with empty geometries or empty street names. Spatial join buildings to the block. Summarize building characteristics at the block level.

Code
sf_use_s2(FALSE)

chicagoBuildingsImp <- chicagoBuildings %>%
    filter(!st_is_empty(.)) %>%
    filter(st_name1 != "") %>%
    mutate(year_built =
        case_when(
            year_built == 0 ~ NA,
            .default = as.integer(year_built)
        )      
    ) %>%
    select(year_built) %>%
    st_join(chicagoBlocks) %>%
    st_drop_geometry() %>%
    left_join(chicagoACS) %>%
    tibble() %>%
    mice(meth = "rf", seed = 234) %>%
    complete() %>%
    as_tibble()

chicagoBldBlk <- chicagoBuildingsImp %>%
    st_drop_geometry() %>%
    group_by(GEOID_blk) %>%
    summarise(
        bldAge_median = 2024 - median(year_built),
        bld_number = n(),
        bldAge_max = 2024 - min(year_built),
        bldAge_mean = 2024 - mean(year_built),
        bldAge_nAft86 = sum(year_built > 1986)
    )

Chicago DWM Lead Tests

Download

Download Department of Water Management lead testing data

Code
pb_test_path <- tempfile()
download.file("https://www.chicagowaterquality.org/DataFiles/wqContent/Results.xlsx", pb_test_path, mode="wb")

chicagoPbTest <- read_excel(path = pb_test_path, skip = 2, sheet = 1) 

Process

Code
chicagoBuildingAddresses <- chicagoBuildings %>%
    filter(!st_is_empty(.)) %>%
    filter(st_name1 != "") %>%
    st_join(chicagoBlocks) %>%
    st_drop_geometry() %>%
    as_tibble() %>%
    arrange(t_add1) %>%
    mutate(ID = row_number()) %>%
    mutate(
        Address = str_pad(t_add1, 2, pad = "0"),
        Address = str_c(str_sub(Address, end = -3), "XX"),
        Address = str_c(Address, pre_dir1, st_name1, st_type1, sep = " "))

chicagoPbBlk <- chicagoPbTest %>%
    filter(!is.na(`Sample Date`)) %>%
    filter(`1st Draw` != "See Follow-Up Sequential Sampling Table for Results") %>%
    mutate(across(c(`1st Draw`, `2/3 Min`, `5 Min`), ~
        case_when(
            str_detect(.x, "<") ~ NA,
            .default = .x
        )
    )) %>%
    mutate(across(c(`1st Draw`, `2/3 Min`, `5 Min`), ~
        case_when(
            is.na(.x) ~ TRUE,
            .default = FALSE
        ),
        .names = "{.col}_lt1"
    )) %>%
    mutate(ID = 
        map_int(Address, ~
            chicagoBuildingAddresses %>%
                filter(Address == .x) %>%
                slice( (n() + (n() %% 2))/2) %>%
                pluck("ID") %>%
                {ifelse(length(.) == 0, NA, .)}
        )
    ) %>% 
    left_join(chicagoBuildingAddresses) %>%
    group_by(GEOID_blk) %>%
    summarize(PB_gt1Pct = 1 - mean(`1st Draw_lt1`), PB_nTests = n()) %>%
    mutate(PB_majoritygt1 = PB_gt1Pct >= 0.5)

Chicago Health Atlas

Code
chicagoHAtract <- request("https://chicagohealthatlas.org/api/action/download/") %>%
    req_method("POST") %>%
    req_body_raw(r"[{"layer":"tract-2020","topics":"EDX-~2018-2022,HDX-~2018-2022,SVI-~2020,CZM-~2018-2022,EVR-~2018,PMC-~2023,LFA-~2019,UNS-~2018-2022,PCI-~2018-2022,DIV-~2018-2022,DIS-~2018-2022,HTJ-~2018-2022,RBU-~2018-2022,VAC-~2018-2022","state":"","within":"","regions":"","place_filters":"","errors":"se","population":false,"lat_long":false,"counties":false,"documentation":false,"format":"csv","insight":"","benchmark":false}]", "application/x-www-form-urlencoded") %>%
    req_perform() %>%
    resp_body_string() %>%
    {read_csv(., col_names = names(read_csv(., n_max = 0)), cols(GEOID = "c"), skip = 2)} %>%
    select(-Layer, -Name) %>%
    rename(GEOID_tract = GEOID) %>%
    drop_na(!ends_with("se"))


chicagoHAPbCA <- request("https://chicagohealthatlas.org/api/action/download/") %>%
    req_method("POST") %>%
    req_body_raw(r"[{"layer":"neighborhood","topics":"LDPP-~2023,LDPPH-~2016","state":"","within":"","regions":"","place_filters":"","errors":"se","population":false,"lat_long":false,"counties":false,"documentation":false,"format":"csv","insight":"","benchmark":false}]", "application/x-www-form-urlencoded") %>%
    req_perform() %>%
    resp_body_string() %>%
    {read_csv(., col_names = names(read_csv(., n_max = 0)), cols(GEOID = "c"), skip = 2)}

chicagoHAblk <- chicagoBlocks %>%
    st_join(
        left_join(chicagoHAPbCA, chicagoCommunityAreas, join_by(GEOID == area_numbe)) %>%
            st_as_sf(), 
        join = st_covered_by, 
        largest = TRUE) %>%
    rename(GEOID_CA = GEOID) %>%
    left_join(chicagoHAtract) %>%
    st_drop_geometry() %>%
    as_tibble() %>%
    select(-Layer, -Name, -community, -area, -shape_area, -perimeter, -area_num_1, -comarea_id, -comarea, -shape_len) %>%
    drop_na(!ends_with("se"))

Chicago Service Line Inventory

Download

Not currently implemented due to time constraints. This an additional data source that I am contemplating using.

Code
resp <- building_blocks %>%
    st_drop_geometry() %>%
    as_tibble() %>%
    filter(year_built != 0) %>%
    slice(1:100) %>% 
    mutate(st_add = str_c(label_hous, pre_dir1, st_name1, st_type1, sep = " ")) %>%
    pull(st_add) %>% 
    map(~
        request("https://sli.chicagowaterquality.org/servicematerialsearch/api/") %>%
            req_method("POST") %>%
            req_body_raw(str_c(r"[{"type":"text","accNum":"-","addr":"]", .x, r"["}]"))
    ) %>%
    req_perform_parallel(on_error = "continue")

resp %>%
    resps_successes() %>%
    keep(map_lgl(., resp_has_body)) %>%
    resps_data(resp_body_json) %>%
    map(~ fromJSON(.x) %>% `[[`(1)) %>%
    map_df(bind_cols)

Joining and Conducting Final Processing

In the end, the following criteria was used to include blocks within the study:

  • Population greater than 0
  • Number of buildings greater than 0
  • Within tract included in the Chicago Health Atlas

This leaves us with:

  • 33,004 total blocks
  • 12,031 blocks with at least one prior DWM test
    • 0.799 blocks have a majority of first draw tests with detected lead levels greater than 1 ppb
Code
pct_calculator <- function(data, category) {
    mutate(data, across(
        .cols = matches(str_c("^", category, "(.*)(?<!(total))E$"), perl = TRUE), 
        .fn = ~ .x / !!sym(str_c(category, "_totalE")),
        .names = "{.col}Pct")) %>%
    select(-matches(str_c("^", category, "(.*)(?<!(total))E$"), perl = TRUE))
}

data <- chicagoACS %>%
    left_join(chicagoPbBlk) %>%
    inner_join(chicagoBldBlk) %>%
    inner_join(chicagoHAblk) %>%
    select(-ends_with("se")) %>%
    relocate(GEOID_CA, PB_gt1Pct, PB_nTests, PB_majoritygt1, .after = GEOID_tract) %>%
    rename_with(.cols = starts_with("race_blk_"), .fn = ~ str_c(.x, "E")) %>%
    filter(race_blk_totalE > 0) %>%
    pct_calculator("race_blk") %>%
    pct_calculator("race_blkGrp") %>%
    pct_calculator("race_tract") %>%
    pct_calculator("education") %>%
    pct_calculator("poverty") %>%
    pct_calculator("computer") %>%
    pct_calculator("internet") %>%
    pct_calculator("plumbing") %>%
    pct_calculator("vacancy") %>%
    pct_calculator("occupied") %>%
    pct_calculator("age") %>%
    pct_calculator("language") %>%
    pct_calculator("native") %>%
    drop_na(-starts_with("PB")) %>%
    select(-matches("^(?!race)(.*)_totalE$", perl = TRUE)) %>%
    rename_with(.cols = starts_with("race"), ~ 
        str_replace(.x, "race", "population") %>%
            str_replace("_totalE", "E")
    )

data %>%
    summarize(across(everything(), ~ sum(is.na(.x)))) %>%
    View()

Exploratory Data Analysis

Tracts that test more tend to have a higher percent of water tests with greater than 1 PPB lead even when adjusting for population size.

Code
data %>%
    group_by(GEOID_tract) %>%
    summarize(
        n = sum(PB_nTests, na.rm = TRUE),
        gt1pct = sum(PB_nTests * PB_gt1Pct, na.rm = TRUE)/ n
    ) %>% 
    group_by(n, gt1pct) %>%
    summarize(nTract = n()) %>% 
    ggplot(aes(x = n, y = gt1pct, color = nTract, size = nTract)) +
        geom_point() +
        labs(
            title = "Number of Tests vs Percent of Tests with Greater than 1 \nPPB Lead Detected by Tract",
            x = "Tests (#)",
            y = "Tests with Greater than 1 PPB Lead Detected (%)") +
        scale_color_gradient(low = "steelblue1", high = "black") +
        scale_y_continuous(labels = scales::percent) +
        theme(
            plot.title = element_text(hjust = 0.5),
            legend.position="none"
        )

Code
data %>%
    group_by(GEOID_tract) %>%
    summarize(
        nPer100People = 100 *sum(PB_nTests, na.rm = TRUE) / population_tractE[1],
        gt1pct = sum(PB_nTests * PB_gt1Pct, na.rm = TRUE)/  sum(PB_nTests, na.rm = TRUE)
    ) %>% 
    ggplot(aes(x = nPer100People, y = gt1pct)) +
        geom_point(color = "steelblue1") +
        labs(
            title = "Number of Tests per 100 Residents vs Percent of Tests with Greater than 1 \nPPB Lead Detected by Tract",
            x = "Tests Per 100 Residents (#)",
            y = "Tests with Greater than 1 PPB Lead Detected (%)") +
        scale_y_continuous(labels = scales::percent) +
        theme(
            plot.title = element_text(hjust = 0.5),
            legend.position="none"
        )

We can see that residents on the north side have disproportionately low positive lead rates when compared to that of the rest of Chicago

Code
data %>%
    group_by(GEOID_blkGrp) %>%
    summarize(
        n = sum(PB_nTests, na.rm = TRUE),
        gt1pct = sum(PB_nTests * PB_gt1Pct, na.rm = TRUE) / n
    ) %>% 
    left_join(block_groups(state = "IL", county = "Cook", progress_bar = FALSE), join_by(GEOID_blkGrp == GEOID)) %>%
    st_as_sf() %$%
        {leaflet(.) %>%
        addProviderTiles(providers$CartoDB.Positron) %>% 
        addPolygons(
            fillColor = ~ colorNumeric("Blues", gt1pct)(gt1pct),
            fillOpacity = 0.7,
            weight = 0,
            highlightOptions = highlightOptions(
                color = "Black",
                fillOpacity = 1,
                bringToFront = T),
            label = str_glue_data(., "<strong>{GEOID_blkGrp}</strong><br>Percent of Tests: {round(gt1pct,2)}<br/>") %>% 
                    map(htmltools::HTML),
            labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "12px",
                direction = "auto")
        ) %>%
        addLegend(
            pal = colorNumeric("Blues", gt1pct),
            values = gt1pct,
            opacity = 0.7,
            title = "Tests with Greater than 1 PPB Lead <br> Detected (%) By Block Group",
            position = "topright",
            na.label = "Insufficient Data",
            labFormat = labelFormat(
                prefix = " ", suffix = " %",
                transform = function(x) 100 * x)
        )} %>%
        htmlwidgets::prependContent(htmltools::tags$style(type = "text/css", "div.info.legend.leaflet-control br {clear: both;}"))
Code
data %>%
    group_by(GEOID_tract) %>%
    summarize(
        n = sum(PB_nTests, na.rm = TRUE),
        gt1pct = sum(PB_nTests * PB_gt1Pct, na.rm = TRUE) / n
    ) %>% 
    left_join(tracts(state = "IL", county = "Cook", progress_bar = FALSE), join_by(GEOID_tract == GEOID)) %>%
    st_as_sf() %$%
        {leaflet(.) %>%
        addProviderTiles(providers$CartoDB.Positron) %>% 
        addPolygons(
            fillColor = ~ colorNumeric("Blues", gt1pct)(gt1pct),
            fillOpacity = 0.7,
            weight = 0,
            highlightOptions = highlightOptions(
                color = "Black",
                fillOpacity = 1,
                bringToFront = T),
            label = str_glue_data(., "<strong>{GEOID_tract}</strong><br>Percent of Tests: {round(gt1pct,2)}<br/>") %>% 
                    map(htmltools::HTML),
            labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "12px",
                direction = "auto")
        ) %>%
        addLegend(
            pal = colorNumeric("Blues", gt1pct),
            values = gt1pct,
            opacity = 0.7,
            title = "Tests with Greater than 1 PPB Lead <br> Detected (%) By Tract",
            position = "topright",
            na.label = "Insufficient Data",
            labFormat = labelFormat(
                prefix = " ", suffix = " %",
                transform = function(x) 100 * x)
        )} %>%
        htmlwidgets::prependContent(htmltools::tags$style(type = "text/css", "div.info.legend.leaflet-control br {clear: both;}"))
Code
data %>%
    group_by(GEOID_CA) %>%
    summarize(
        n = sum(PB_nTests, na.rm = TRUE),
        gt1pct = sum(PB_nTests * PB_gt1Pct, na.rm = TRUE) / n
    ) %>% 
    left_join(chicagoCommunityAreas, join_by(GEOID_CA == area_numbe)) %>%
    st_as_sf() %$%
        {leaflet(.) %>%
        addProviderTiles(providers$CartoDB.Positron) %>% 
        addPolygons(
            fillColor = ~ colorNumeric("Blues", gt1pct)(gt1pct),
            fillOpacity = 0.7,
            weight = 0,
            highlightOptions = highlightOptions(
                color = "Black",
                fillOpacity = 1,
                bringToFront = T),
            label = str_glue_data(., "<strong>{community}</strong><br>Percent of Tests: {round(gt1pct,2)}<br/>") %>% 
                    map(htmltools::HTML),
            labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "12px",
                direction = "auto")
        ) %>%
        addLegend(
            pal = colorNumeric("Blues", gt1pct),
            values = gt1pct,
            opacity = 0.7,
            title = "Tests with Greater than 1 PPB Lead <br> Detected (%) By Community Area",
            position = "topright",
            na.label = "Insufficient Data",
            labFormat = labelFormat(
                prefix = " ", suffix = " %",
                transform = function(x) 100 * x)
        )} %>%
        htmlwidgets::prependContent(htmltools::tags$style(type = "text/css", "div.info.legend.leaflet-control br {clear: both;}"))
Code
data %>%
    group_by(GEOID_blkGrp) %>%
    summarize(
        nPer100People = 100 *sum(PB_nTests, na.rm = TRUE) / population_blkGrpE[1]
    ) %>% 
    left_join(block_groups(state = "IL", county = "Cook", progress_bar = FALSE), join_by(GEOID_blkGrp == GEOID)) %>%
    st_as_sf() %$%
        {leaflet(.) %>%
        addProviderTiles(providers$CartoDB.Positron) %>% 
        addPolygons(
            fillColor = ~ colorNumeric("Blues", nPer100People)(nPer100People),
            fillOpacity = 0.7,
            weight = 0,
            highlightOptions = highlightOptions(
                color = "Black",
                fillOpacity = 1,
                bringToFront = T),
            label = str_glue_data(., "<strong>{GEOID_blkGrp}</strong><br>Tests per 100 People: {round(nPer100People,2)}<br/>") %>% 
                    map(htmltools::HTML),
            labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "12px",
                direction = "auto")
        ) %>%
        addLegend(
            pal = colorNumeric("Blues", nPer100People),
            values = nPer100People,
            opacity = 0.7,
            title = "Tests per 100 People <br>By Block Group",
            position = "topright",
            na.label = "Insufficient Data"
        )} %>%
        htmlwidgets::prependContent(htmltools::tags$style(type = "text/css", "div.info.legend.leaflet-control br {clear: both;}"))
Code
data %>%
    group_by(GEOID_tract) %>%
    summarize(
        nPer100People = 100 *sum(PB_nTests, na.rm = TRUE) / population_tractE[1]
    ) %>% 
    left_join(tracts(state = "IL", county = "Cook", progress_bar = FALSE), join_by(GEOID_tract == GEOID)) %>%
    st_as_sf() %$%
        {leaflet(.) %>%
        addProviderTiles(providers$CartoDB.Positron) %>% 
        addPolygons(
            fillColor = ~ colorNumeric("Blues", nPer100People)(nPer100People),
            fillOpacity = 0.7,
            weight = 0,
            highlightOptions = highlightOptions(
                color = "Black",
                fillOpacity = 1,
                bringToFront = T),
            label = str_glue_data(., "<strong>{GEOID_tract}</strong><br>Tests per 100 People: {round(nPer100People,2)}<br/>") %>% 
                    map(htmltools::HTML),
            labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "12px",
                direction = "auto")
        ) %>%
        addLegend(
            pal = colorNumeric("Blues", nPer100People),
            values = nPer100People,
            opacity = 0.7,
            title = "Tests per 100 People <br>By Tract",
            position = "topright",
            na.label = "Insufficient Data"
        )} %>%
        htmlwidgets::prependContent(htmltools::tags$style(type = "text/css", "div.info.legend.leaflet-control br {clear: both;}"))
Code
data %>%
    group_by(GEOID_tract) %>%
    summarize(
        bldAge_mean = sum(bldAge_mean * bld_number, na.rm = TRUE)/sum(bld_number, na.rm = TRUE),
        PB_gt1Pct = sum(PB_gt1Pct * PB_nTests, na.rm = TRUE)/sum(PB_nTests, na.rm = TRUE)
    ) %>% 
    ggplot(aes(x = bldAge_mean, y = PB_gt1Pct)) +
        geom_point(color = "steelblue1") +
        labs(
            title = "Mean Building Age vs Percent of Tests with Greater than 1 \nPPB Lead Detected by Tract",
            x = "Mean Building Age (years)",
            y = "Tests with Greater than 1 PPB Lead Detected (%)") +
        scale_y_continuous(labels = scales::percent) +
        theme(
            plot.title = element_text(hjust = 0.5),
            legend.position="none"
        )

Code
data %>%
    group_by(GEOID_tract) %>%
    summarize(
        bldAge_nBfr86 = 1-((sum(bldAge_nAft86, na.rm = TRUE))/sum(bld_number, na.rm = TRUE)),
        PB_gt1Pct = sum(PB_gt1Pct * PB_nTests, na.rm = TRUE)/sum(PB_nTests, na.rm = TRUE)
    ) %>% 
    ggplot(aes(x = bldAge_nBfr86, y = PB_gt1Pct)) +
        geom_point(color = "steelblue1") +
        labs(
            title = "Percent of Buildings built after 1986 vs Percent of Tests with \nGreater than 1 PPB Lead Detected by Tract",
            x = "Buildings built after 1986 (%)",
            y = "Tests with Greater than 1 PPB Lead Detected (%)") +
        scale_x_continuous(labels = scales::percent) +
        scale_y_continuous(labels = scales::percent) +
        theme(
            plot.title = element_text(hjust = 0.5),
            legend.position="none"
        )

Code
data %>%
    select(where(is.numeric)) %>%
    select(where(function(x) sum(!is.na(x)) != 0)) %>%
    select(where(function(x) var(x, na.rm = TRUE) != 0)) %>% 
    correlate(method = "spearman") %>%
    autoplot()

Modeling

Code
tidymodels_prefer()

Creating Test/Training Data

Code
set.seed(345) 
model_fitting_data <- data %>%
    mutate(PB_nTests = 
        case_when(
            is.na(PB_nTests) ~ 5,
            .default = PB_nTests
        )
    ) %>%
    filter(!is.na(PB_majoritygt1)) %>%
    mutate(PB_majoritygt1 = as.factor(PB_majoritygt1)) %>%
    filter(GEOID_CA == "22") %>%
    select(-PB_gt1Pct, -GEOID_blk, -GEOID_CA)

split <- initial_split(model_fitting_data, strata = PB_majoritygt1)
train <- training(split)
test  <- testing(split)

set.seed(456) 
resamples <- vfold_cv(train, v = 5, strata = PB_majoritygt1)

Creating Our Recipes

Code
default_recipe <- model_fitting_data %>%    
    recipe(PB_majoritygt1 ~ .) %>%   
    step_novel(all_nominal_predictors()) %>%   
    step_dummy(all_nominal_predictors()) %>%
    step_zv(all_predictors()) %>%
    step_smote(all_outcomes())

normalized_recipe <- default_recipe %>%
    step_normalize(all_predictors())

Creating Our Models

Code
logistic_class_spec <- 
    logistic_reg(penalty = tune(), mixture = tune()) %>% 
    set_engine("glmnet") %>% 
    set_mode("classification")

lightGBM_class_spec <- 
    boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), min_n = tune(), sample_size = tune(), trees = tune()) %>%
    set_engine("lightgbm") %>%
    set_mode("classification")

rf_class_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

Creating Workflow

Code
normalized <- workflow_set(
    preproc = list(normalized = normalized_recipe),
    models = list(logistic = logistic_class_spec))

no_pre_proc <- workflow_set(
    preproc = list(simple = default_recipe), 
    models = list(RF = rf_class_spec, boosting = lightGBM_class_spec))
Code
all_workflows <- bind_rows(no_pre_proc, normalized)

Tune Model

Code
conflicted::conflicts_prefer(purrr::flatten)
conflicted::conflicts_prefer(purrr::set_names)

grid_results <- all_workflows %>%
  workflow_map(
    "tune_race_anova",
    seed = 567,
    resamples = resamples,
    grid = 30,
    control = control_race(
      save_pred = FALSE,
      save_workflow = FALSE,
      parallel_over = "everything"))
Code
autoplot(grid_results)

Select Best Model

Code
conflicted::conflicts_prefer(dplyr::filter)

best_wf <- grid_results %>% 
    rank_results() %>%
    filter(.metric == "roc_auc", rank == 1) %>%
    pull(wflow_id)

best_results <- grid_results %>%
    extract_workflow_set_result(best_wf) %>% 
    select_best(metric = "roc_auc") 

best_results_fit <- 
    grid_results %>% 
    extract_workflow(best_wf) %>%
    finalize_workflow(best_results) %>% 
    last_fit(split = split)
Code
bayes_results <- grid_results %>% 
    extract_workflow(best_wf) %>%
    tune_bayes(
        resamples = resamples,
        initial = 20,
        iter = 25,
        control = control_bayes(verbose = TRUE)
    )
Code
bayes_results %>%
  show_best(metric = "roc_auc") %>% 
  knitr::kable()
trees min_n tree_depth learn_rate loss_reduction sample_size .metric .estimator mean n std_err .config .iter
633 2 13 0.0017964 0.0000367 0.3963353 roc_auc binary 0.7191007 5 0.0337507 Iter9 9
1290 33 12 0.0004314 0.0478811 0.3810152 roc_auc binary 0.7188557 5 0.0486365 Preprocessor1_Model17 0
1186 21 12 0.0012358 0.0000000 0.3930846 roc_auc binary 0.7157183 5 0.0400548 Iter16 16
932 3 8 0.0000000 3.4591221 0.2315099 roc_auc binary 0.7154971 5 0.0359560 Preprocessor1_Model01 0
1737 21 7 0.0000000 13.4263931 0.2740644 roc_auc binary 0.7143354 5 0.0459334 Iter10 10
Code
final_fit <- grid_results %>% 
    extract_workflow(best_wf) %>%
    finalize_workflow(select_best(bayes_results, metric = "roc_auc")) %>% 
    last_fit(split = split)
Code
final_fit %>%
    collect_predictions() %>%
    roc_curve(PB_majoritygt1, .pred_FALSE) %>%
    autoplot()

Code
final_fit %>% 
    collect_metrics() %>%
    knitr::kable()
.metric .estimator .estimate .config
accuracy binary 0.6382979 Preprocessor1_Model1
roc_auc binary 0.6828125 Preprocessor1_Model1
Code
final_fit %>%
    collect_predictions() %>%
    conf_mat(PB_majoritygt1, .pred_class) %>%
    autoplot(type = "heatmap")

References

[1]
B. Q. Huynh, E. T. Chin, and M. V. Kiang, “Estimated Childhood Lead Exposure From Drinking Water in Chicago,” JAMA pediatrics, p. e240133, Mar. 2024, doi: 10.1001/jamapediatrics.2024.0133.